Skip to content

[Feature] add traceback to error logs and optimize trace log#7608

Open
xyxinyang wants to merge 1 commit intoPaddlePaddle:developfrom
xyxinyang:dev-log-v2
Open

[Feature] add traceback to error logs and optimize trace log#7608
xyxinyang wants to merge 1 commit intoPaddlePaddle:developfrom
xyxinyang:dev-log-v2

Conversation

@xyxinyang
Copy link
Copy Markdown
Collaborator

@xyxinyang xyxinyang commented Apr 24, 2026

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

针对 FastDeploy 的日志系统进行优化,预计分 4 个 pr 完成。

pr 内容 状态
1 新增日志相关参数、错误同时输出到终端 已合入
2 日志通道划分、request.log 级别划分和聚合 已合入
3 worker_process.log、cache_manager.log、paddle 日志收敛和简化 已合入
4 trace.log 日志的规范化和整合 当前 pr

Modifications

1. 错误日志添加 traceback

  • 为 try 块里的log_request_error.error() 调用添加 traceback.format_exc()
  • 涉及 20+ 个文件,确保异常时能看到完整调用栈

2. trace.log 优化

  • 新增 2 个缓存事件(constants.py, prefix_cache_manager.py):
    • CACHE_HIT - Prefix Cache 命中,可解释请求 TTFT 较快的原因(复用缓存跳过部分 Prefill)
    • CACHE_MISS - Prefix Cache 未命中,可解释请求 TTFT 较慢的原因(需完整 Prefill)

3. 清理

  • 删除未使用的 FD_TRACE 环境变量(envs.py)

Usage or Command

用法没变

Accuracy Tests

N/A

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Apr 24, 2026

Thanks for your contribution!

PaddlePaddle-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 87.27273% with 7 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@d92cad9). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/register_manager.py 0.00% 2 Missing ⚠️
fastdeploy/worker/input_batch.py 33.33% 2 Missing ⚠️
fastdeploy/cache_manager/cache_messager.py 0.00% 1 Missing ⚠️
fastdeploy/cache_manager/prefix_cache_manager.py 75.00% 1 Missing ⚠️
fastdeploy/engine/async_llm.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7608   +/-   ##
==========================================
  Coverage           ?   72.31%           
==========================================
  Files              ?      419           
  Lines              ?    57907           
  Branches           ?     9089           
==========================================
  Hits               ?    41877           
  Misses             ?    13175           
  Partials           ?     2855           
Flag Coverage Δ
GPU 72.31% <87.27%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

@PaddlePaddle-bot
Copy link
Copy Markdown

PaddlePaddle-bot commented Apr 27, 2026

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-04-27 17:10:31

CI报告基于以下代码生成(30分钟更新一次):


1 任务总览

❌ 当前有 1 个 Required 任务失败8 个 Required 任务运行中,需等待完成后评估。

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
36(0) 36 24 1 9 2 0

2 任务状态汇总

2.1 Required任务 : 1/10 通过

必选任务阻塞合并,失败需优先处理。

状态 任务 耗时 根因 修复建议 日志 重跑
Approval 9s PR问题:日志行为及envs.py修改需特定RD审批,审批缺失 请 zyyzghb 及 jiangjiajun 等在 PR 上 approve Job -
run_tests_with_coverage - 运行中 - - -
run_ce_cases - 运行中 - - -
base_tests - 运行中 - - -
run_tests_logprob - 运行中 - - -
run_4_cards_tests - 运行中 - - -
stable_tests - 运行中 - - -
run_xpu_4cards_cases - 运行中 - - -
run_xpu_8cards_cases - 运行中 - - -
其余 1 个必选任务通过 - - - - -

2.2 可选任务 — 23/26 通过

可选任务不阻塞合并,失败仅供参考。

状态 任务 耗时 日志 重跑
Trigger Jenkins for PR - - -
⏸️ CI_HPU - - -
⏸️ Run iluvatar Tests / run_iluvatar_cases - - -
其余 23 个可选任务通过 - - -

3 失败详情(仅 required)

Approval — PR审批流程(置信度: 高)

根因详情:
PR 修改了日志行为(新增 traceback.format_exc() 到多处 logger.error 调用)以及 fastdeploy/envs.py,触发了仓库的审批检查机制。检查脚本 scripts/check_approval.sh 检测到 2 条未满足的审批要求,以 exit code 6 退出。

关键日志:

Detected log modification in diff:
0. You must have one FastDeploy RD (Jiang-Jia-Jun(jiangjiajun), yuanlehome(liuyuanle), rainyfly(chenjian26), Wanglongzhi2001(wanglongzhi)) approval for modifying [fastdeploy/envs.py].
1. You must have one FastDeploy RD (xyxinyang(zhouchong), zyyzghb(zhangyongyue)) approval for modifying logging behavior (.info/.debug/.error/log_request).
There are 2 approved errors.
Process completed with exit code 6.

修复建议:

  1. 针对规则0(修改 fastdeploy/envs.py):请 jiangjiajunliuyuanlechenjian26wanglongzhi 中任意一人在 PR 上 approve
  2. 针对规则1(修改日志行为):请 zyyzghb(zhangyongyue)在 PR 上 approve(注:PR 作者 xyxinyang 无法为自己审批)

关联变更: PR 标题为 "add traceback to error logs and optimize trace log",大量新增 traceback.format_exc()logger.error 调用,触发了日志行为修改的审批规则。同时修改了 fastdeploy/envs.py,触发了额外的审批规则。

PaddlePaddle-bot

This comment was marked as outdated.

PaddlePaddle-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-04-28 11:01:59

📋 Review 摘要

PR 概述:为 FastDeploy 错误日志系统性添加 traceback 信息(20+ 文件),新增 CACHE_HIT/CACHE_MISS trace 事件,并删除废弃的 FD_TRACE 环境变量

变更范围cache_manager/engine/entrypoints/envs.pytrace/constants.py

影响面 Tag[KVCache] [Engine] [APIServer] [FDConfig]

📝 PR 规范检查

标题使用了官方合法 Tag [Feature],可接受;实质上 [Optimization] 更贴合本次变更(日志增强而非新功能),供作者参考。Checklist 中存在 2 个不适用项被勾选,按 D3 规则应直接删除。

标题建议(可直接复制):

  • [Optimization] add traceback to error logs and optimize trace log

PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):

## Motivation

针对 FastDeploy 的日志系统进行优化(系列第 4 个 PR):为 try 块中的错误日志添加 traceback 信息,确保异常时能看到完整调用栈;同时规范化 trace 日志事件,新增 Prefix Cache 命中/未命中事件。

## Modifications

1. **错误日志添加 traceback**:为 20+ 个文件中 try 块里的 `log_request_error``.error()` 调用添加 `traceback.format_exc()`,确保异常时能看到完整调用栈
2. **trace.log 新增 2 个缓存事件**`constants.py``prefix_cache_manager.py`):
   - `CACHE_HIT` - Prefix Cache 命中,可解释请求 TTFT 较快的原因(复用缓存跳过部分 Prefill)
   - `CACHE_MISS` - Prefix Cache 未命中,可解释请求 TTFT 较慢的原因(需完整 Prefill)
3. **清理**:删除未使用的 `FD_TRACE` 环境变量(`envs.py`);移除 `metrics/trace.py` 中重复的 `print()` 调用

## Usage or Command

N/A

## Accuracy Tests

N/A

## Checklist

- [x] Add at least a tag in the PR title.
  - Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
  - You can add new tags based on the PR content, but the semantics must be clear.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.

问题

级别 文件 概述
📝 PR 规范 无具体行号 Checklist 中 [x] Provide accuracy results[x] If the current PR is submitting to the release branch 两项不适用,应删除
❓ 疑问 fastdeploy/cache_manager/prefix_cache_manager.py:1012 CACHE_HIT/CACHE_MISS trace 事件的 user 参数传入 "" 而非 getattr(task, "user", ""),与同文件其他调用不一致

总体评价

PR 整体质量良好,系统性地为 20+ 个文件的 except 块补充了 traceback 信息,提升了生产环境异常排查能力;新增的 CACHE_HIT/CACHE_MISS 事件填补了 trace 日志中 Prefix Cache 可观测性的空白,测试覆盖较全面。仅有 PR 规范格式和一处 user 参数一致性的小问题,不阻塞合入。

if matched_block_num > 0:
self.metrics.hit_req_count += 1
# Record CACHE_HIT trace event
trace_print(LoggingEventName.CACHE_HIT, req_id, "")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 user 参数传入空字符串 "",与同文件其他 trace_print 调用(如 WRITE_CACHE_TO_STORAGE_START 使用 getattr(request, "user", ""))不一致。

此处 task 即 Request 对象,建议同步改为:

trace_print(LoggingEventName.CACHE_HIT, req_id, getattr(task, "user", ""))

CACHE_MISS 那行同理。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants